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Abstract 

We investigate the relation between backbone and side-chain ordering in a small protein. 
For this purpose we have performed multicanonical simulations of the villin headpiece sub- 
domain HP-36, an often used toy model in protein studies. Concepts of circular statistics are 
introduced to analyze side-chain fluctuations. In contrast to earlier studies on homopolypep- 
tides (Wei et al., J. Phys. Chem. B, 111 (2007) 4244) we do not find collective effects lead- 
ing to a separate transition. Rather, side-chain ordering is spread over a wide temperature 
range. Our results indicate a thermal hierarchy of ordering events, with side-chain ordering 
appearing at temperatures below the helix-coil transition but above the folding transition. 
We conjecture that this thermal hierarchy reflects an underlying temporal order, and that 
side-chain ordering facilitates the search for the correct backbone topology. 



I. INTRODUCTION 

The process by which a protein folds into its biologically active state cannot be traced in 
all details solely by experiments. Fortunately, modern simulation techniques have opened 
another window, often leading to new insight into the dynamics and thermodynamics of 
foldingi'^i^ii^i^'^. Generalized ensemble techniques- such as parallel tempering^i^ii^ or multi- 
canonical samplingiiii^, first introduced to protein science in Ref. 13', have made it possible 
to study the folding of small proteins (with up to ~ 50 residues^) in silico. Of particular 
interest is whether there are different distinct transitions in the folding process, and what 
their thermal order and relation is. 

An example is the role of side-chain ordering. In recent studies on homopolymersi^ii^, 
we found for certain amino acids a de-coupling of backbone and side-chain ordering. The 
ordering did not depend on the details of the environment, i.e. whether the molecules were 
in gas phase or solvent, but solely on the particular side groups. It exhibited a transition-like 
character, marked by an accompanying peak in the specific heat. In the present work we 
extend this study to proteins, i.e. heteropolymers of amino acids. 

Our test protein is the villin head piece subdomain HP-36 with which we are familiar 
from earlier work— li^ii^. This molecule has raised considerable interest in computational 
biology— iSi as it is one of the smallest proteins (596 atoms) with well-defined secondary 
and tertiary structure^^ but at the same time still accessible to simulations^^. Its structure 
was resolved by NMR analysis and is shown in Fig. [1] as it is available in the Protein Data 
Bank— (PDB code Ivii). We use multicanonical sampling to study the thermal behavior of 
the protein in aqueous solvent over a wide range of temperatures from one single simulation. 
Such an approach is well-suited to overcome the problem of "slowness" of side-chain ordering 
observed in canonical simulations^^i^I. 

We observe that side-chain ordering occurs over a wide range in temperatures below the 
helix-coil transition. Although we do not find the collective effects leading to a separate side- 
chain ordering transition that were observed for homopolymers— ^i^, this result indicates that 
secondary structure formation is a necessary precursor for side-chain ordering. On the other 
hand, side-chain ordering occurs at higher temperatures than those at which the protein 
backbone assumes its native fold. We conjecture that HP-36 folds in a multi-step process, 
with side-chain ordering facilitating the search for the correct backbone topology. 



II. METHODS 

Our simulations utilize the ECEPP/2 force field^^ as implemented in the 2005 version of 
the program package SMMP— i^. Here the interactions between the atoms of the protein are 
approximated by a sum Eecepp/2 consisting of electrostatic energy Ec, a Lennard- Jones 
term E^j, hydrogen-bonding term Ehb and a torsion energy Etot'- 

-E'ecepp/2 = Ec + Elj + Ehb + Exor 
_ ^ 332gjg-,- 
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+ ^f/z(l±cos(nzeO), (1) 

I 

where Vij is the distance between the atoms i and j, C,i is the l-th torsion angle, and energies 
are measured in kcal/mol. The protein-solvent interactions are approximated by a solvent 
accessible surface term 

Esoiv = 22 ^*^« • (^) 

i 

The sum is over the solvent accessible areas Ai of all atoms i weighted by solvation parameters 



cr, as determined in Ref. 
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a common choice when the ECEPP/2 force field is utilized. 
Our previous experiences^!^ have shown that Egoiv reproduces the effects of protein-water 
interaction qualitatively correct. However, the temperature scale is often distorted, leading, 
for instance, to transitions at temperatures where water would be vaporized in nature. 
This problem can be remedied, however, by renormalization of the temperature scale upon 
comparison with experiments. 

The above defined energy function leads to a landscape that is characterized by a mul- 
titude of minima separated by high barriers. As the probability to cross an energy barrier 
of height AE is given by exp(— AE'/Z^^T), ks being the Boltzmann constant, it follows 
that extremely long runs are necessary to obtain sufficient statistics in regular canonical 
simulations at low temperatures. Hence, in order to enhance sampling we rely on the mul- 



ticanonical approach^*^ as described in Ref. |13|. Here, configurations are weighted with a 



non-canonical term wmu{E) usually determined iteratively to optimize certain properties 



of the simulation. Thermodynamic averages of an observable < O > at temperature T are 
obtained by re-weighting2^: 

JdxO{x)e'^(^y^-^/wMu[E{x)] 

where x counts the configurations of the system. 

Most often the multicanonical weight is determined such that the probability distribution 
obeys 

Pmu{E) cx n{E)wMu{E) ^ const , (4) 

where n{E) is the spectral density of the system. However, in our implementation we do not 
require a constant histogram but that the number of round-trips rirt between two pre-set 
low and high energy values Eiow and E^igh is maximal. E^igh is a an energy value typical 
for an disordered high temperature state (in our example Ehigh = -133.5 kcal/mol) while 
Eiow = —357 Kcal/mol was chosen to correspond to typical low-energy states as determined 
by us in preliminary studies. Obviously, the number of round-trips rij-t between the lowest 
and highest temperature, Eio^ and E^igh, respectively, is a lower bound for the statistically 
independent visits at the low energy states, and therefore a good measure for the efficiency 
of the simulation. For this reason, it is desirable to maximize the number of round trips by 
optimizing wmtjJE). T 



described in Refs. 
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"his can be achieved in a systematic way by the feedback algorithm 
j35l . The resulting weights are given as supplemental material. 

A simulation of five million Monte Carlo sweeps (each consisting of 217 Metropolis steps 
that try to update all 217 dihedral angles of the molecule once) leads to 35 tunneling events, 
i.e. at least 35 independent configurations with energies smaller than —357 kcal/mol. Every 
ten sweeps, we measure the energy E with its respective contributions from Eq. ([1]) and 
from the protein-solvent interaction energy Egoiv Other quantities measured are the radius 
of gyration Rgy as a measure of the geometrical size, and the number of helical residues nn, 
i.e. residues where the pair of dihedral angles {(pjip) takes values in the range (—70° ± 30°, 
—37° ±30°)^. Also we monitor the RMSD (root mean square deviation) of various subsets 
of heavy atoms (backbone, sidechain, all) from the PDB structure. 

Finally, all the 217 dihedral angles are recorded for later analysis of their fiuctuations 
and correlations. As the statistical analysis of dihedral angles has subtle pitfalls, we present 
and justify our approach in the Appendix. 



III. RESULTS AND DISCUSSIONS 

Multicanonical simulations allow determining thermodynamic quantities over a wide 
range of temperatures. The thermal evolution of the specific heat, for example, 

C{T) = ^E = ksP' {<E'>-<E >2) (5) 

provides information about the temperatures where the protein changes its state. In earlier 
investigationsi^i^ of homopolymers we observed two separate peaks in the specific heat for 
particular amino acids, characterizing two well-defined transitions. One peak was associ- 
ated with a helix-coil transition, i.e. the ordering of the protein backbone. The second 
peak, at a much lower temperature, could be related to an ordering of side chains. These 
results indicated a two-step folding process upon lowering the temperature, starting with 
backbone ordering followed by side-chain ordering. How does the situation look like for a 
heteropolymer such as HP-36? 

The specific heat curve in Fig. [2] has only one marked peak at T = 505 ± 8 K; but it also 
exhibits a shoulder around T=300K. As for the homopolymers, the peak in the specific heat 
can be related to a helix-coil transition. This interpretation is supported by the inset where 
we display the average number < uh > of residues that are part of an a-helix as a function 
of temperature. The steep increase in this quantity at T = 505K is clearly correlated with 
the peak in the specific heat. 

However, for HP-36 backbone ordering is more than just the formation of secondary 
structure. The average radius of gyration < Rgy >, a measure for the compactness of 
a protein configuration, as a function of temperature, is displayed in Fig. [31 It indicates 
that backbone ordering occurs in more than one step. Below T = 505 K, most protein 
configurations have a high helix propensity. Lowering the temperature further, compact 
structures become finally more frequent than extended configurations with equal or even 
higher helicity. This two-step process in the backbone ordering can also be seen in the inset 
which displays the fraction of configurations with a RMSD smaller than 6 A, i.e. those that 
should fall within the free energy basin of the native structure^. Final compactification and 
transition to nativeness are therefore concomitant processes. We believe that the shoulder 
in the specific heat is correlated with that final backbone ordering since it occurs close to 
the steepest parts of the decrease of < Rgy > and the increase of nativeness. 



Hence, our results so far indicate a two-step process, but one that involves only the 
backbone. The first step, correlated with a critical temperature T = 505 K involves the 
formation of helical segments. In a second step, these arrange themselves to compact and 
native-like structures. The energy gain here is much smaller, and therefore this second 
ordering step is observed at lower temperatures only. 

How does side-chain ordering fit into this picture? The behavior of the specific heat 
does not give any indications of a separate transition related to side-chain ordering. Such a 
transition could still exist — albeit not associated with large energy fiuctuations. A quantity 
that describes side-chain ordering in a very general way is the average of the fiuctuations of 
dihedral angles. We have calculated this quantity as described in the appendix for buried side 
chains and compare it with fiuctuations of angles belonging to side chains at the surface of 
the molecule. Both quantities are displayed in Fig. H] for all angles of a side chain, and in the 
inset solely for the Xi angle. For the buried residues one observes a single step ordering of the 
side chains. Immediately below the helix-coil transition the fiuctuations decrease, indicating 
that here the formation of helical segments leads already to some ordering of side chains. In 
the temperature range 300 — 500 K the fiuctuations decrease further, albeit less dramatic. 
This range corresponds to the shoulder in the specific heat and marks compactification and 
the formation of the tertiary backbone structure. Residues at the surface exhibit a much 
smaller decrease of fiuctuations associated with the formation of helical segments. 

At higher temperatures, side-chain ordering is restricted to residues in the interior of 
a protein. This is reasonable as here the side chain positions are more constraint by the 
geometry of the molecule. For this reason, we have focused our further analysis on side chain 
angles of residues in the interior of the molecule. Fig. [5] shows the fiuctuations of the Xi 
angle for the residues Phey, Phen and Pheis. Fluctuations of these angles decrease strongly 
over a small range of temperatures below the formation of the helical segments. We note 
that Phey exhibits ordering at a somewhat higher temperature than Phen and Pheig. 

The decrease in fiuctuations is only loosely related to an increase in correlations between 
the xi angles of these three residues, see Fig. El where the data were determined as described 
in the appendix. Phey exhibits correlated fiuctuations with Phen already close to the helix 
coil transition. They persist and increase finally in the low temperature phase. Phey and 
Phcis exhibit ( ant i-) correlations only below 350 K. The most dramatic change occurs with 
Phen and Pheis: Their correlations start to occur around 450 K, i.e. just when those 



angles are ordering; however, upon lowering the temperature the correlations switch to anti- 
correlations and increase in magnitude. 

Note, that all correlated fluctuations of these side chains exhibit their steepest change 
below 350 K, where Fig. [3] and its inset indicate the folding transition into the native 
backbone topology. On the other hand, this is the regime where angle fluctuations have 
subsided already. Hence, for HP-36 the correct ordering of the side chains seem to predate 
tertiary structure formation. These results also indicate that the final arrangement of the 
side chains occurs collectively. 

The above results indicate the following sequence of events in the folding of villin head- 
piece subdomain HP-36 upon lowering the temperature. The first stage is the formation 
of helical segments, connected with a large gain in potential energy. Below this helix-coil 
transition is a large intermediate temperature range where various helical configurations 
other than the native one dominate for entropic reasons. This temperature range is also 
characterized by an increased side-chain ordering that is more pronounced for side chains 
of residues in the interior that arrange themselves in coordinated way. The heterogeneity of 
the sequence seems to destroy the phase transition-like character of side-chain ordering that 
was observed by us for some homopolymers. Instead, the ordering is more gradual. Only at 
temperatures below side-chain ordering, and connected with a much smaller gain in energy 
than at the helix-coil transition, do the helical segments arrange themselves in native-like 
structures. 

Our results show a particular thermal order of the folding processes. It is natural to 
assume that this thermal order reflects a related temporal order of folding events. Hence, 
we conjecture that HP-36 folds in multi-step process where side chain and backbone ordering 
are interconnected. The initial step is the formation of helical segments. In a second step 
the protein collapses into more compact structures before it assumes its native state. This 
sequence of events is consistent with various computational^'^'^ and experimental^ studies 
that also identify the formation of helical segments as the time limiting factor in the folding 
of HP-36. New is our observation that the search for the correct structure seems to be 
facilitated by the ordering of side chains subsequent to secondary structure formation. This 
scenario is also consistent with recent mutagenesis experiments (relying on nanosecond laser 
T-jump measurements) that emphasize the importance of buried side chains for the rather 
short folding times of the villin headpiece^"^. 
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IV. SUMMARY AND OUTLOOK 

Choosing a well-studied small protein, the villin headpiece subdomain HP-36, we have 
presented methods that allow us to simulated and analyze ordering processes taking place 
on the level of the side chain dihedral angles as well as at the level of the backbone struc- 
tures. Our results indicate a thermal hierarchy of ordering events with side-chain ordering 
appearing at temperatures below the helix-coil transition but above the folding transition. 
We believe that the observed thermal hierarchy of folding reflects an underlying temporal 
sequence of these ordering processes in actual protein folding dynamics. We conjecture that 
side-chain ordering facilitates the search for the correct backbone topology. Further studies 
along these lines on different proteins will elucidate how general such a scenario is. 

Acknowledgments Support by a research grant (CHE-0313618) of the National Science 
Foundation (USA) is acknowledged. 

APPENDIX A: STATISTICAL ANALYSIS OF DIHEDRAL ANGLES 

Correct statistical analysis of dihedral angles is somewhat subtle because of their peri- 
odicity modulo 2-71. This property excludes the use of regular statistical measures like the 
mean angle, (a), or its variance, {{a — {a)Y). The reason is that the numerical values of 
those quantities depend on the reference frame chosen, e.g. [— vr, vr] or [0, 27r] or any other 
interval of length 27r. Moreover, choosing an inappropriate reference frame can lead, e.g. to 
the spurious appearance of a bimodal distributions from an underlying unimodal one. 

On the other hand, there exist the well-established mathematical fields of circular or 
directional statistic^^^^^ that deal with such problems. However, we believe that some the 
quantities and equations used there introduce unnecessary complications and do not fully 
reflect the underlying physical concepts. So, here we will borrow some ideas from that field, 
but we will not fully follow that approach. 

The important fundamental idea introduced in circular and directional statistics is that 
an angle a can be viewed as a two-dimensional vector of unit length 

a=f"f!) . (Al) 

\sm a / 



a concept that bears some similarity to, e.g., a spin in an XY-model^ treated in statistical 
physics. Consequently, we are interested in the mean direction a, also considered to be a 
unit vector. It can be determined from the averaged vector 

(a)=ff°f!>) . (A2) 

which is usually smaller than a unit vector, 

R^(a) = {cos{a)f + (sin(a))^ < 1 , (A3) 

by 

^ (a) (A4) 



R{a 
From this mean direction vector a corresponding mean angle a could be determined in an 

appropriate frame, 

S-("°f,*) . (A5) 

Notice that - as we will see below - most often it is not necessary to determine that angle. 
Rather, it is sufficient to work with either the mean vector (a), Eq. (\A2\i . or the mean 
direction vector a , Eq. (]A4|) . 

In this contribution we concentrate mostly on fluctuations and correlations between 
dihedral angles. Correlation analysis, in particular, is a somewhat complex field in the 
directional statistics literature, sometimes motivated and dominated by the fact that the 
underlying data are temporal and the goal is the detection of circadian rhythms'^^'^^. More- 
over, the quantities employed for describing fiuctuations do not always match up with those 
employed for describing correlations. Below we sketch the problems and justify our approach. 

The simplest measure for fiuctuations is based on the length of the average vector, 
Eq. (1A3p . The circular variance is given simply by 

V{a) = l-R{a) . (A6) 

V^ = corresponds to vanishing fiuctuations, while V = 1 describes the case of an equidis- 
tribution of angles over the full range, i.e. maximal fiuctuations. Interestingly, the circular 
variance can be derived, too, by considering the deviation vectors from the mean direction, 
i.e. 
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= l((a^)-2(a).a + a^ 

= 1-R{a) . (A7) 

Ideally, in order to systematically analyze correlations and fluctuations together, a co- 
variance function C (a,, aj) is necessary that generalizes the fluctuation measure employed. 
Combining the chosen covariance and fluctuation functions, the correlation matrix is finally 
given by 

p(a„a,) = ^£fef^ ■ (A8) 

p = denotes vanishing correlations, either since there are no fluctuations at all, or because 
the fluctuations are uncorrelated. p — > ±1 corresponds to full correlation or anti-correlation 
of the fluctuations, respectively. 

Unfortunately, a straightforward extension from Eqs. (1A6P and (lA7p . e.g. defining the 
covariance function as the scalar product of the respective deviation vectors from the mean 
direction, C (a^, aj) oc ((a^ — a^) ■ (a^ — aj)), is not possible. This quantity does not vanish 
if the angles are statistically independent, as it should for a proper covariance. Instead, 
replacing the deviations from the mean direction by the deviations from the mean vector 
does result in a seemingly proper covariance function, 

Cdiflf {ai, aj) = {{ai - (a^)) ■ (a^ - (a^))) . (A9) 

The related variance function differs from Eq. (1A6I) though, 

Kiiff(a) = (|a-(a)|n = (an-(a)' 



1 



(cos(a)) + (sin(a))^ 



= l-R^a) . (AlO) 

Both forms, V (a) and Vdiff (a), are related by a monotonic — albeit nonlinear — mapping 
and describe fluctuations in a qualitatively similar way. The only quantitative difference is 
that Eq. (]A10|) better resolves the small fluctuation regime while Eq. (1A6P does that with 



the regime of large fluctuations. 

While we do not consider the changed variance to be a problem, there is one with Eq. (1A9I) . 
Although Cdis («j,«j) exhibits the correct behavior in the limit of statistical independence 
of the angles, we have observed that problems arise in the regime of larger correlations. 
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This is due to the fact that |(aj)| 7^ \{^j)\ usually holds, which leads to an imbalance in the 
treatment of the respective deviation vectors. 
The authors of Ref. 



45l suggest to describe correlations between angles by the covariance 



function 



'-^sin \C>ii, Otjj 



(sin(aj — ai) sin(Q;j — Oij)) 



(All) 



This function also exhibits the correct behavior for independently distributed angles, and 
again - the related variance function differs from Eq. (lA6p . 



VliT, ia) 



'sm V"; — \Sin \Ci — a J 



(A12) 



Notice that this measure of fluctuations necessarily includes higher order moments of the 
angular trigonometric functions than those Eqs. flA6l) and flAlOl) use. Consequently, there 
does not exist a simple analytic mapping to the circular variance, and - particularly for large 
fluctuations - a non-monotonic relationship is possible^. 

We note that, as mentioned above, it is actually not necessary to determine the average 
angle a explicitly for evaluating Eq. (1A11|) . Rather, this equation also has a vector rep- 
resentation, albeit by using the cross product of vectors in addition to the scalar product. 
Extending a to a 3d vector via 



' cosfa) > 



sml^aj 




(A13) 



/ 



and using trigonometric identities, it can be easily seen that the the sine of the angle differ- 
ence is given by the z-component of the cross product (—ax a). Consequently, the covariance 
( lAllI) can be represented as 



Csin (aj, aj) = <^(ai X a^) ■ [k^ x a^ 



(A14) 



Analogously, the corresponding fluctuations are represented via 



V'sinCa) 



— 2 



a,; X a,; 



(A15) 



As outlined above, we would have preferred to systematically analyze fluctuations and 
correlations together, either using Eqs. flAlOl) and flA9l) . or Eqs. flA12l) and flAll|) . However, 
the problems with the covariance Eq. (lAQD — imbalance in the large correlations regime — 



12 



and the variance Eq. (1A12P — non-nionotonicity in the large fluctuations regime — do not 
allow this. 

Rather, we decided to employ a hybrid approach: When dealing with fluctuations we 
always rely on the circular variance, Eqs. flA6p since it is the simplest reliable approach. 
When dealing with correlations, we use the covariance Csin («i, «j), Eq. (lAlip . Necessarily, we 
have to employ the problematic variance Kjm(«), Eq. (]A12p . as normalization in determining 
the correlation function p{ai,aj), Eq. flA8l) . Since in our case correlations arise only in the 
regime where fluctuations are small, we feel that is an acceptable approach. It also outweighs 
the problems that arise from using Eq. ( JA9I) . 

We emphasize in closing that — to our knowledge — no satisfying approach exists yet to 
treat strong dihedral angle correlations in the large fluctuations regime. 
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Figure captions: 

Fig. 1: Structure of HP-36 (picture was obtained by VMD— ). 

Fig. 2: Specific lieat as a function of temperature. The inset displays the hehcity as a 
function of temperature. 

Fig. 3: Radius of gyration as a function of temperature. The inset shows the fraction of 
configurations with a backbone RMSD from the PDB structure less than 6 A. 

Fig. 4: Averaged fluctuations, Eq. ( 1A6I) . of side chain angles from buried and surface side 
groups, respectively; the inset shows average of only the Xi angle fluctuations. Note 
the errorbars denote the average of the errors in the fluctuations of each individual x 
angle. 

Fig. 5: Averaged fluctuations, Eq. flA6|l . of Xi for Phe7, Phell, and Phel8. 

Fig. 6: Correlations, Eq. flASp . based on Eqs. flAlip and flA12p . of Xi fluctuations between 
Phe7 and Phell, Phe7 and Phel8, and Phell and Phel8, respectively; see also the 
discussion in the appendix. 
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